In this assignment, we will explore, analyze and model a data set containing approximately 8000 records, each representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is binary. A “1” indicates that the customer was in a car crash while 0 indicates that they were not. The second response variable is TARGET_AMT. This value is 0 if the customer did not crash their car. However, if they did crash their car, this number will be a value greater than 0.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict whether a customer will crash their car and to predict the cost in the case of crash. We will only use the variables given to us (or variables that we derive from the variables provided).
Below is a short description of the variables of interest in the data set:
data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance_training_data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance-evaluation-data.csv", header = TRUE)The dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous.
| Name | data_train |
| Number of rows | 8161 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| factor | 14 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| INCOME | 0 | 1 | FALSE | 6613 | $0: 615, emp: 445, $26: 4, $48: 4 |
| PARENT1 | 0 | 1 | FALSE | 2 | No: 7084, Yes: 1077 |
| HOME_VAL | 0 | 1 | FALSE | 5107 | $0: 2294, emp: 464, $11: 3, $11: 3 |
| MSTATUS | 0 | 1 | FALSE | 2 | Yes: 4894, z_N: 3267 |
| SEX | 0 | 1 | FALSE | 2 | z_F: 4375, M: 3786 |
| EDUCATION | 0 | 1 | FALSE | 5 | z_H: 2330, Bac: 2242, Mas: 1658, <Hi: 1203 |
| JOB | 0 | 1 | FALSE | 9 | z_B: 1825, Cle: 1271, Pro: 1117, Man: 988 |
| CAR_USE | 0 | 1 | FALSE | 2 | Pri: 5132, Com: 3029 |
| BLUEBOOK | 0 | 1 | FALSE | 2789 | $1,: 157, $6,: 34, $5,: 33, $6,: 33 |
| CAR_TYPE | 0 | 1 | FALSE | 6 | z_S: 2294, Min: 2145, Pic: 1389, Spo: 907 |
| RED_CAR | 0 | 1 | FALSE | 2 | no: 5783, yes: 2378 |
| OLDCLAIM | 0 | 1 | FALSE | 2857 | $0: 5009, $1,: 4, $1,: 4, $4,: 4 |
| REVOKED | 0 | 1 | FALSE | 2 | No: 7161, Yes: 1000 |
| URBANICITY | 0 | 1 | FALSE | 2 | Hig: 6492, z_H: 1669 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 5151.87 | 2978.89 | 1 | 2559 | 5133 | 7745 | 10302.0 | ▇▇▇▇▇ |
| TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
| TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
| KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 | ▇▁▁▁▁ |
| AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
| HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 | ▇▂▁▁▁ |
| YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 | ▂▃▇▃▁ |
| TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
| TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
| CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 | ▇▂▁▁▁ |
| MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
| CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 | ▆▇▇▃▁ |
The currency notation found in some values will interfere with our analysis so we need reformat those values appropriately.
strip_dollars <- function(x){
x <- as.character(x)
x <- gsub(",", "", x)
x <- gsub("\\$", "", x)
as.numeric(x)
}
fix_formatting <- function(messy_df){
messy_df %>%
rowwise() %>%
mutate(INCOME = strip_dollars(INCOME),
HOME_VAL = strip_dollars(HOME_VAL),
BLUEBOOK = strip_dollars(BLUEBOOK),
OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
ungroup()
}We noticed that a few variables that are listed as discrete have large numbers of unique values. A closer inspection of the variable descriptions reveals that that while these variables are encoded as factors they are actually continuous. The TARGET_FLAG variable also appears in the summary as numeric variable, but it should be a binary factor. We proceed to fix these data types.
Also, there are some values that seem invalid (i.e. -3 CAR_AGE). Since both variables the missing values are less than 5% then we can replace the missing values with the median. We Will take the median on the training set only and impute in both training and testing to avoid overfitting.
We apply the processing steps above to both the training and testing datasets.
We now explore the distribution of TARGET_FLAG across the numeric variables. We see that BLUEBOOK, INCOME, OLDCLAIM have a high number of outliers compared to other variables. We also see that customers with who are older, or have older cars, higher home values, higher income tend to get into fewer car crashes. However, people with motor vehicle record points or high number of old claims tend to get into more accidents.
plot_vars <- c("TARGET_FLAG", names(keep(data_train, is.numeric)))
data_train[plot_vars] %>%
select(-INDEX, -TARGET_AMT) %>%
gather(variable, value, -TARGET_FLAG) %>%
ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) +
geom_boxplot() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())data_train %>%
select(-TARGET_FLAG, -TARGET_AMT, -INDEX) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(stat = 'count',fill='pink') +
facet_wrap(~key, scales = 'free')The variables dislayed below need scale transformations like OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL. AGEhas a guassian distribution. We see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.
data_train %>%
select(OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')data_train %>%
keep(is.numeric) %>%
gather(variable, value, -TARGET_AMT, -INDEX, -CLM_FREQ, -HOMEKIDS, -KIDSDRIV, -MVR_PTS) %>%
ggplot(., aes(value, TARGET_AMT)) +
geom_point() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())We see MVR_PTS, CLM_FREQ, and OLDCLAIM are the most positively correlated variables with our response variables. Whereas, URBANICITY is the most negatively correlated variable. Rest of the variables are weakly correlated.
corr_dataframe = data_train %>%
mutate_if(is.factor, as.numeric)
q <- cor(corr_dataframe)
ggcorrplot(q, type = "upper", outline.color = "white",
ggtheme = theme_classic,
colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3) set.seed(42)
accidents <- data_train %>%
filter(TARGET_FLAG == 1)
ggplot(accidents, aes(x=TARGET_AMT)) +
geom_density(fill='pink') +
theme_light() +
geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")As was previously noted this distribution has a long tail. The mean payout is $5616 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.
outlier <- min(boxplot(data_train[data_train$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
data_train %>%
mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, "Yes", "No")) %>%
group_by(TARGET_AMT_OUTLIER) %>%
summarise(Mean = mean(TARGET_AMT),
Median = median(TARGET_AMT))
0 1
6008 2153
There is an imbalance in the TARGET_FLAG variable
Let’s check the class distribution
0 1
0.7361843 0.2638157
The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.
To treat this unbalance, we would use the over sampling
set.seed(42)
minority <- nrow(data_train[data_train$TARGET_FLAG == 1,])
majority <- nrow(data_train[data_train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- data_train[data_train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(data_train, .) %>%
bind_rows(data_train)check the balance again
0 1
6008 6008
# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
precision.deviance=numeric(), stringsAsFactors=FALSE) # A function to extract the relevant metrics from the summary and confusion matrix
build_model <- function(id, formula, data) {
glm.fit <- glm(formula, data=data, family=binomial)
print(summary(glm.fit))
glm.probs <- predict(glm.fit, type="response")
# Confirm the 0.5 threshold
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
results <- tibble(target=data$target, pred=glm.pred)
results <- results %>%
mutate(pred.class = as.factor(pred), target.class = as.factor(target))
#print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
res.deviance <- glm.fit$deviance
null.deviance <- glm.fit$null.deviance
aic <- glm.fit$aic
metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec, precision=prec)
metrics <- lapply(metrics, round, 3)
plot(roc(results$target.class,glm.probs), print.auc = TRUE)
model.df <- tibble(id=id, formula=formula, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance,
aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity, precision=metrics$precision)
model.list <- list(model=glm.fit, df_info=model.df)
return(model.list)
}